This case is about a bank (Thera Bank) which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.
The department wants to build a model that will help them identify the potential customers who have higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign. The data file contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
The classification goal is to predict the likelihood of a liability customer buying personal loans.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
df=pd.read_csv(r'C:\Users\$ubhajit\Downloads\Bank_Personal_Loan_Modelling-1.csv')
#Making a copy of the dataset.
df_copy=df.copy()
df.columns
#replacing spaces with an underscore and return lower case text
df.columns = [i.replace(' ', '_').lower() for i in df.columns]
df.columns
df.shape
There are 5000 rows and 14 columns in the dataset.
df.info()
df.head()
df.tail()
loans_counts=df['personal_loan'].value_counts().to_frame()
loans_counts
df.describe().T
df.nunique()
features = ['age', 'experience', 'family','income']
df[features].hist(figsize=(12, 8))
#!pip install pandas_profiling
import pandas_profiling
df.profile_report()
#Dropping 'ID' and 'ZIP Code' as they are just numbers of series
df.drop('id',axis=1,inplace=True)
df.drop('zip_code',axis=1,inplace=True)
df_edu=pd.crosstab(df['education'],df['personal_loan'])
df_edu.div(df_edu.sum(1).astype(float),axis=0).plot(kind='bar', stacked=True)
print('Cross tabulation can be given as:','\n',df_edu)
print('Cross tabulation in percentages:','\n',df_edu.div(df_edu.sum(1).astype(float),axis=0))
df_family=pd.crosstab(df['family'],df['personal_loan'])
df_family.div(df_family.sum(1).astype(float),axis=0).plot(kind='bar', stacked=True)
print('Cross tabulation can be given as:','\n',df_family)
print('Cross tabulation in percentages:','\n',df_family.div(df_family.sum(1).astype(float),axis=0))
df_cd=pd.crosstab(df['cd_account'],df['personal_loan'])
df_cd.div(df_cd.sum(1).astype(float),axis=0).plot(kind='bar', stacked=True)
print('Cross tabulation can be given as:','\n',df_cd)
print('Cross tabulation in percentages:','\n',df_cd.div(df_cd.sum(1).astype(float),axis=0))
df_credit=pd.crosstab(df['creditcard'],df['personal_loan'])
df_credit.div(df_credit.sum(1).astype(float),axis=0).plot(kind='bar', stacked=True)
print('Cross tabulation can be given as:','\n',df_credit)
print('Cross tabulation in percentages:','\n',df_credit.div(df_credit.sum(1).astype(float),axis=0))
df_online=pd.crosstab(df['online'],df['personal_loan'])
df_online.div(df_online.sum(1).astype(float),axis=0).plot(kind='bar', stacked=True)
print('Cross tabulation can be given as:','\n',df_online)
print('Cross tabulation in percentages:','\n',df_online.div(df_online.sum(1).astype(float),axis=0))
df_securities=pd.crosstab(df['securities_account'],df['personal_loan'])
df_securities.div(df_securities.sum(1).astype(float),axis=0).plot(kind='bar', stacked=True)
print('Cross tabulation can be given as:','\n',df_securities)
print('Cross tabulation in percentages:','\n',df_securities.div(df_securities.sum(1).astype(float),axis=0))
df.groupby('personal_loan')['ccavg'].mean().plot(kind='bar')
df.groupby('personal_loan')['income'].mean().plot(kind='bar')
df.groupby('personal_loan')['experience'].mean().plot(kind='bar')
#rechecking missing value:
df.isnull().sum()
As we saw earlier in univariate analysis, Mortgage contains outliers, so I must treat them as the presence of outliers affects the distribution of the data.Due to these outliers’ bulk of the data in the Mortgage is Right Skewed.One way to remove the skewness is by doing the z-score.
# importing Z-score from stst library.
from scipy import stats
df['Mortgage_Zscore']=np.abs(stats.zscore(df['mortgage']))
df=df[df['Mortgage_Zscore']<3]
df.drop('Mortgage_Zscore',axis=1,inplace=True)
df.shape
Here I had chosen those rows only whose z_score is less than 3, it can vary accordingly. Here I had dropped more than 100+ rows which contain outliers and now I can start with the model building.
#Importing library for spliting data
from sklearn.model_selection import train_test_split
#Importing model
from sklearn.linear_model import LogisticRegression
#Importing library to check accuracy_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve,auc
# set of independent variable
X=df.drop('personal_loan',axis=1)
# set of dependent variable
y=df['personal_loan']
#Spliting of the data in the ratio of 70:30 respectively
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=1)
#Importing libraries
from sklearn import preprocessing
#Get column names
column_name= df.columns
#Create scaler object
scaler=preprocessing.StandardScaler()
#Fit data on the scaler object
scaled_X_train=scaler.fit_transform(X_train)
scaled_X_test=scaler.fit_transform(X_test)
scaled_X_train
scaled_X_test
#normalize the data attributes
normalized_X = preprocessing.normalize(X)
logreg=LogisticRegression()
#Fitting the model into training dataset
logreg.fit(scaled_X_train,y_train)
y_pred=logreg.predict(scaled_X_test)
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
So, we can see 95% accuracy by applying logistic regression model.
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logreg_prob=logreg.predict_proba(scaled_X_test)
fpr,tpr,threshold=roc_curve(y_test,logreg_prob[:,1])
roc_auc=auc(fpr,tpr)
print('Area under the ROC curve :%f'%roc_auc)
The area under ROC curve got 95% accuracy.
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %f)' % roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
The dotted line represents the ROC curve of a purely random classifier.
Let us see if the decision tree with criterion as entropy can nail it down to higher recall value or with criterion as gini.
from sklearn.tree import DecisionTreeClassifier
dec_treeE=DecisionTreeClassifier(criterion='entropy',random_state=1)
dec_treeE.fit(scaled_X_train,y_train)
y_pred=dec_treeE.predict(scaled_X_test)
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
dec_treeE_prob=dec_treeE.predict_proba(scaled_X_test)
fpr1,tpr1,threshold1=roc_curve(y_test,dec_treeE_prob[:,1])
roc_auc1=auc(fpr1,tpr1)
print('Area under the ROC curve :%f'%roc_auc1)
plt.figure()
plt.plot(fpr1, tpr1, label='Decision Tree Entropy(area = %f)' % roc_auc1)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
from sklearn.tree import DecisionTreeClassifier
dec_treeG=DecisionTreeClassifier(criterion='gini',random_state=1)
dec_treeG.fit(scaled_X_train,y_train)
y_pred=dec_treeG.predict(scaled_X_test)
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
dec_treeG_prob=dec_treeG.predict_proba(scaled_X_test)
fpr2,tpr2,threshold2=roc_curve(y_test,dec_treeG_prob[:,1])
roc_auc2=auc(fpr2,tpr2)
print('Area under the ROC curve :%f'%roc_auc2)
plt.figure()
plt.plot(fpr2, tpr2, label='Decision Tree GINI(area = %f)' % roc_auc2)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
The area under the ROC curve is much smaller in this case too. The reason being the ‘overfitting’ of the data. Let us check the accuracy score for training as well as test data.
print(dec_treeE.score(scaled_X_train,y_train))
print(dec_treeG.score(scaled_X_train,y_train))
Now we will use Hyperparameter prunning to generalise the overfitting of data.
from sklearn.tree import DecisionTreeClassifier
dec_tree_prunE=DecisionTreeClassifier(criterion='entropy',max_depth=4,min_samples_leaf=7,random_state=2)
dec_tree_prunE.fit(scaled_X_train,y_train)
y_pred=dec_tree_prunE.predict(scaled_X_test)
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
dec_tree_prunE_prob=dec_tree_prunE.predict_proba(scaled_X_test)
fpr3,tpr3,threshold3=roc_curve(y_test,dec_tree_prunE_prob[:,1])
roc_auc3=auc(fpr3,tpr3)
print('Area under the ROC curve :%f'%roc_auc3)
We can see the difference now. We got 98% accuracy with 88% recall value, also the AUC is 99% which is fairly good. Let us check the same for the criterion ‘Gini’ then we can conclude to the results of decision trees.
from sklearn.tree import DecisionTreeClassifier
dec_tree_prunG=DecisionTreeClassifier(criterion='gini',max_depth=4,min_samples_leaf=7,random_state=2)
dec_tree_prunG.fit(scaled_X_train,y_train)
y_pred=dec_tree_prunG.predict(scaled_X_test)
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
dec_tree_prunG_prob=dec_tree_prunG.predict_proba(scaled_X_test)
fpr3,tpr3,threshold3=roc_curve(y_test,dec_tree_prunG_prob[:,1])
roc_auc3=auc(fpr3,tpr3)
print('Area under the ROC curve :%f'%roc_auc3)
We got 83% recall value with a 97% accuracy level, also AUC is approximately 98% which is also fairly good.
from sklearn.naive_bayes import GaussianNB
naive_model=GaussianNB()
naive_model.fit(scaled_X_train,y_train)
y_pred=naive_model.predict(scaled_X_test)
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
naive_model_prob=naive_model.predict_proba(scaled_X_test)
fpr4,tpr4,threshold4=roc_curve(y_test,naive_model_prob[:,1])
roc_auc4=auc(fpr4,tpr4)
print('Area under the ROC curve :%f'%roc_auc4)
We got 59% recall value with a 89% accuracy level, also AUC is approximately 92%.
The Decision Tree model is the best as the accuracy of the train and test set is almost similar and also the precsion and recall accuracy is good. The confusion matrix is also better in comparision to Logistic Regression and Naive Bayes model. Naive bayes has very low performance in this case.